Skip to content

fix: handle tool execution timeout/error causing IllegalStateExceptio…#956

Open
chensk0601 wants to merge 9 commits intoagentscope-ai:mainfrom
chensk0601:fix/951-react-agent-tool-execution-error-handling
Open

fix: handle tool execution timeout/error causing IllegalStateExceptio…#956
chensk0601 wants to merge 9 commits intoagentscope-ai:mainfrom
chensk0601:fix/951-react-agent-tool-execution-error-handling

Conversation

@chensk0601
Copy link
Copy Markdown

…n (#951)

ReActAgent throws IllegalStateException when tool calls timeout or fail, because no tool result is written to memory, leaving orphaned pending tool call states that crash the agent on subsequent requests.

Root cause:

  • Tool execution timeout/error propagates without writing results to memory
  • Pending tool call state remains, blocking subsequent doCall() invocations
  • validateAndAddToolResults() throws when user message has no tool results

Changes:

  • doCall(): detect pending tool calls without user-provided results and auto-generate error results to clear the pending state
  • executeToolCalls(): add onErrorResume to catch tool execution failures and generate error tool results instead of propagating exceptions
  • Add generateAndAddErrorToolResults() helper to create error results for orphaned pending tool calls

This ensures the agent recovers gracefully from tool failures instead of crashing, and the model receives proper error feedback to continue processing.

Closes #951

AgentScope-Java Version

[The version of AgentScope-Java you are working on, e.g. 1.0.9, check your pom.xml dependency version or run mvn dependency:tree | grep agentscope-parent:pom(only mac/linux)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with mvn spotless:apply
  • All tests are passing (mvn test)
  • Javadoc comments are complete and follow project conventions
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

@chensk0601 chensk0601 requested a review from a team March 14, 2026 06:38
@cla-assistant
Copy link
Copy Markdown

cla-assistant bot commented Mar 14, 2026

CLA assistant check
All committers have signed the CLA.

@cla-assistant
Copy link
Copy Markdown

cla-assistant bot commented Mar 14, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


凡勇 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@chensk0601 chensk0601 force-pushed the fix/951-react-agent-tool-execution-error-handling branch from f3080ad to 86c49aa Compare March 14, 2026 07:00
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 14, 2026

Codecov Report

❌ Patch coverage is 63.63636% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...e/src/main/java/io/agentscope/core/ReActAgent.java 63.63% 16 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

@chensk0601 chensk0601 force-pushed the fix/951-react-agent-tool-execution-error-handling branch 2 times, most recently from 0a3e447 to 0684fd6 Compare March 16, 2026 11:51
agentscope-ai#951)

ReActAgent throws IllegalStateException when tool calls timeout or fail,
because no tool result is written to memory, leaving orphaned pending
tool call states that crash the agent on subsequent requests.

Root cause:
- Tool execution timeout/error propagates without writing results to memory
- Pending tool call state remains, blocking subsequent doCall() invocations
- validateAndAddToolResults() throws when user message has no tool results

Changes:
- doCall(): detect pending tool calls without user-provided results and
  auto-generate error results to clear the pending state
- executeToolCalls(): add onErrorResume to catch tool execution failures
  and generate error tool results instead of propagating exceptions
- Add generateAndAddErrorToolResults() helper to create error results
  for orphaned pending tool calls

This ensures the agent recovers gracefully from tool failures instead of
crashing, and the model receives proper error feedback to continue
processing.

Closes agentscope-ai#951
Copy link
Copy Markdown
Collaborator

@LearningGp LearningGp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling tool exceptions as ToolResult seems like a solid approach. For pending tool calls where no result is provided, I’m wondering if it might be more appropriate to expose those to the developer for handling instead. Also, perhaps we could consider adding a configurable exception handler mechanism in the future? (Just a thought—this last point definitely doesn't need to block the PR).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes ReActAgent resiliency when tool execution fails (timeout/error) by ensuring pending tool-call state is cleared via synthetic error tool results, preventing IllegalStateException on subsequent calls.

Changes:

  • Update ReActAgent#doCall() to detect pending tool calls without user-provided tool results and auto-generate error tool results to clear pending state.
  • Update ReActAgent#executeToolCalls() to convert tool-execution failures into error tool results via onErrorResume instead of propagating exceptions.
  • Update HookStopAgentTest expectations to validate the new auto-recovery behavior (no longer expecting an exception).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java Adds auto-recovery for orphaned pending tool calls and converts tool execution failures into error tool results.
agentscope-core/src/test/java/io/agentscope/core/hook/HookStopAgentTest.java Updates tests to expect auto-recovery rather than IllegalStateException when pending tool calls exist.

@chensk0601 chensk0601 requested a review from LearningGp March 27, 2026 03:08
chensk0601 and others added 3 commits March 27, 2026 11:17
- Extract shared buildErrorToolResult() helper to deduplicate ToolResultBlock construction
- Route generateAndAddErrorToolResults() through PostActingEvent hook pipeline for consistent tool-result lifecycle (StreamingHook TOOL_RESULT emission, hook-based transforms)
- Narrow onErrorResume catch scope to Exception.class, letting critical JVM errors (e.g. OutOfMemoryError) propagate
- Use ExceptionUtils.getErrorMessage() for non-null error messages and log the exception object itself for full stack traces
- Strengthen HookStopAgentTest auto-recovery assertions: verify error ToolResultBlock in memory, model re-invocation, and response content
Copy link
Copy Markdown
Collaborator

@LearningGp LearningGp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

@chensk0601
Copy link
Copy Markdown
Author

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.

@LearningGp
Copy link
Copy Markdown
Collaborator

LearningGp commented Mar 27, 2026

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.

It seems that the modifications to executeToolCalls should be sufficient to ensure a ToolResultBlock is present following a timeout or exception, which would prevent the persistent IllegalStateException.

However, I have some reservations about the auto-supplement logic in doCall. Since HITL (Human-In-The-Loop) workflows require a ToolResultBlock to be manually provided when resuming a conversation, automating this process might lead to unintended execution paths. It could also potentially mask improper HITL usage, making it harder to detect implementation errors.

It’s great to hear that this fix has been verified in production! A couple of follow-up questions:

  • Have we encountered specific scenarios where the auto-supplement logic in doCall was actually triggered?

  • In our current context, would it be possible to achieve the same goal by only modifying executeToolCalls?

@chensk0601
Copy link
Copy Markdown
Author

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.

It seems that the modifications to executeToolCalls should be sufficient to ensure a ToolResultBlock is present following a timeout or exception, which would prevent the persistent IllegalStateException.

However, I have some reservations about the auto-supplement logic in doCall. Since HITL (Human-In-The-Loop) workflows require a ToolResultBlock to be manually provided when resuming a conversation, automating this process might lead to unintended execution paths. It could also potentially mask improper HITL usage, making it harder to detect implementation errors.

It’s great to hear that this fix has been verified in production! A couple of follow-up questions:

  • Have we encountered specific scenarios where the auto-supplement logic in doCall was actually triggered?
  • In our current context, would it be possible to achieve the same goal by only modifying executeToolCalls?

Thanks for the thorough review! I appreciate the concern about HITL workflows — preserving the explicit contract for manual ToolResultBlock provision is indeed important.

Let me address your questions:

Regarding scenarios where doCall auto-supplement was triggered

While executeToolCalls's onErrorResume handles failures within tool execution, we've identified cases where pending states persist outside that coverage:

  1. Pre-execution failures: Network timeouts before toolkit.callTools() begins, JVM crashes, or async scheduling failures where tool calls are queued but never executed
  2. Memory recovery: When agent state is persisted and restored (e.g., after restart), orphaned pending IDs may exist without corresponding results

In these cases, executeToolCalls is never reached, so its error handling cannot apply. The doCall auto-recovery acts as a safety net for these edge cases.

On achieving the goal with only executeToolCalls modifications

For failures occurring during executeToolCalls, yes — the onErrorResume logic is sufficient. However, for the scenarios above, we need the doCall layer to handle states that originate from outside the tool execution flow.

Proposed compromise to protect HITL workflows

To address your concern about masking improper HITL usage, I suggest adding an explicit mode check:

if (providedResults.isEmpty()) {
    if (isHITLMode()) {  // Check if conversation is in HITL pause state
        // Strict behavior: require manual ToolResultBlock
        throw new IllegalStateException(
            "HITL workflow requires manual ToolResultBlock for pending IDs: " + pendingIds);
    }
    // Non-HITL: Auto-recover from orphaned states
    log.warn("Orphaned pending tool calls detected, auto-generating error results: {}", pendingIds);
    generateAndAddErrorToolResults(pendingIds);
}

@chensk0601
Copy link
Copy Markdown
Author

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.

It seems that the modifications to executeToolCalls should be sufficient to ensure a ToolResultBlock is present following a timeout or exception, which would prevent the persistent IllegalStateException.
However, I have some reservations about the auto-supplement logic in doCall. Since HITL (Human-In-The-Loop) workflows require a ToolResultBlock to be manually provided when resuming a conversation, automating this process might lead to unintended execution paths. It could also potentially mask improper HITL usage, making it harder to detect implementation errors.
It’s great to hear that this fix has been verified in production! A couple of follow-up questions:

  • Have we encountered specific scenarios where the auto-supplement logic in doCall was actually triggered?
  • In our current context, would it be possible to achieve the same goal by only modifying executeToolCalls?

Thanks for the thorough review! I appreciate the concern about HITL workflows — preserving the explicit contract for manual ToolResultBlock provision is indeed important.

Let me address your questions:

Regarding scenarios where doCall auto-supplement was triggered

While executeToolCalls's onErrorResume handles failures within tool execution, we've identified cases where pending states persist outside that coverage:

  1. Pre-execution failures: Network timeouts before toolkit.callTools() begins, JVM crashes, or async scheduling failures where tool calls are queued but never executed
  2. Memory recovery: When agent state is persisted and restored (e.g., after restart), orphaned pending IDs may exist without corresponding results

In these cases, executeToolCalls is never reached, so its error handling cannot apply. The doCall auto-recovery acts as a safety net for these edge cases.

On achieving the goal with only executeToolCalls modifications

For failures occurring during executeToolCalls, yes — the onErrorResume logic is sufficient. However, for the scenarios above, we need the doCall layer to handle states that originate from outside the tool execution flow.

Proposed compromise to protect HITL workflows

To address your concern about masking improper HITL usage, I suggest adding an explicit mode check:

if (providedResults.isEmpty()) {
    if (isHITLMode()) {  // Check if conversation is in HITL pause state
        // Strict behavior: require manual ToolResultBlock
        throw new IllegalStateException(
            "HITL workflow requires manual ToolResultBlock for pending IDs: " + pendingIds);
    }
    // Non-HITL: Auto-recover from orphaned states
    log.warn("Orphaned pending tool calls detected, auto-generating error results: {}", pendingIds);
    generateAndAddErrorToolResults(pendingIds);
}

Thanks for the thorough review! I've carefully reconsidered this, and I believe this is fundamentally a bug fix rather than an enhancement — adding configuration parameters would not be the right approach.

Why this is a critical bug (not just an enhancement)

The current behavior causes complete conversation failure on any tool execution error:

  1. First failure: Tool timeout/exception → IllegalStateException → Agent crashes
  2. Cascading effect: Pending state persists in memory
  3. All subsequent requests: Fail with same exception, making the agent unusable

This violates basic fault isolation principles — a single tool failure should not crash the entire agent permanently.

Two key reasons why auto-supplement is necessary

First, tool execution failures should not leave the agent in a permanently broken state. Whether it's a timeout, network error, or unexpected exception, the agent must recover and continue functioning. The "auto-supplement" in doCall serves as a critical safety net for orphaned pending states that originate outside executeToolCalls coverage (e.g., JVM crashes, memory recovery, pre-execution failures).

Second, manually generating error results and feeding them to the LLM is essential for proper decision-making. Without this feedback, the model has no visibility into what happened. By providing explicit error messages (e.g., "[ERROR] Tool execution failed: timeout"), we enable the model to:

  • Understand the failure context
  • Decide on alternative approaches
  • Communicate meaningfully with the user

This is not masking errors — it's proper error propagation to the model layer.

Regarding HITL concerns

The HITL workflow concern is valid, but I'd argue:

  • HITL pause/resume should be handled at a higher orchestration layer, not by leaving the agent in a corrupted state
  • If HITL requires manual ToolResultBlock, it should explicitly manage the conversation lifecycle (clearing/resuming state) rather than relying on low-level exceptions

Summary

Approach Result
Only executeToolCalls fix Covers runtime failures, but leaves recovery gaps
Configuration switches Adds complexity without solving the root problem
Current PR (both fixes) Complete fault isolation + proper LLM feedback

The combination ensures robust error handling while keeping the implementation clean and deterministic. I'm happy to discuss alternative HITL integration patterns if needed, but I believe the core fix should remain as-is.

Would appreciate your thoughts on this perspective.

@chensk0601 chensk0601 requested a review from LearningGp March 29, 2026 02:30
@LearningGp
Copy link
Copy Markdown
Collaborator

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.

It seems that the modifications to executeToolCalls should be sufficient to ensure a ToolResultBlock is present following a timeout or exception, which would prevent the persistent IllegalStateException.
However, I have some reservations about the auto-supplement logic in doCall. Since HITL (Human-In-The-Loop) workflows require a ToolResultBlock to be manually provided when resuming a conversation, automating this process might lead to unintended execution paths. It could also potentially mask improper HITL usage, making it harder to detect implementation errors.
It’s great to hear that this fix has been verified in production! A couple of follow-up questions:

  • Have we encountered specific scenarios where the auto-supplement logic in doCall was actually triggered?
  • In our current context, would it be possible to achieve the same goal by only modifying executeToolCalls?

Thanks for the thorough review! I appreciate the concern about HITL workflows — preserving the explicit contract for manual ToolResultBlock provision is indeed important.
Let me address your questions:

Regarding scenarios where doCall auto-supplement was triggered

While executeToolCalls's onErrorResume handles failures within tool execution, we've identified cases where pending states persist outside that coverage:

  1. Pre-execution failures: Network timeouts before toolkit.callTools() begins, JVM crashes, or async scheduling failures where tool calls are queued but never executed
  2. Memory recovery: When agent state is persisted and restored (e.g., after restart), orphaned pending IDs may exist without corresponding results

In these cases, executeToolCalls is never reached, so its error handling cannot apply. The doCall auto-recovery acts as a safety net for these edge cases.

On achieving the goal with only executeToolCalls modifications

For failures occurring during executeToolCalls, yes — the onErrorResume logic is sufficient. However, for the scenarios above, we need the doCall layer to handle states that originate from outside the tool execution flow.

Proposed compromise to protect HITL workflows

To address your concern about masking improper HITL usage, I suggest adding an explicit mode check:

if (providedResults.isEmpty()) {
    if (isHITLMode()) {  // Check if conversation is in HITL pause state
        // Strict behavior: require manual ToolResultBlock
        throw new IllegalStateException(
            "HITL workflow requires manual ToolResultBlock for pending IDs: " + pendingIds);
    }
    // Non-HITL: Auto-recover from orphaned states
    log.warn("Orphaned pending tool calls detected, auto-generating error results: {}", pendingIds);
    generateAndAddErrorToolResults(pendingIds);
}

Thanks for the thorough review! I've carefully reconsidered this, and I believe this is fundamentally a bug fix rather than an enhancement — adding configuration parameters would not be the right approach.

Why this is a critical bug (not just an enhancement)

The current behavior causes complete conversation failure on any tool execution error:

  1. First failure: Tool timeout/exception → IllegalStateException → Agent crashes
  2. Cascading effect: Pending state persists in memory
  3. All subsequent requests: Fail with same exception, making the agent unusable

This violates basic fault isolation principles — a single tool failure should not crash the entire agent permanently.

Two key reasons why auto-supplement is necessary

First, tool execution failures should not leave the agent in a permanently broken state. Whether it's a timeout, network error, or unexpected exception, the agent must recover and continue functioning. The "auto-supplement" in doCall serves as a critical safety net for orphaned pending states that originate outside executeToolCalls coverage (e.g., JVM crashes, memory recovery, pre-execution failures).

Second, manually generating error results and feeding them to the LLM is essential for proper decision-making. Without this feedback, the model has no visibility into what happened. By providing explicit error messages (e.g., "[ERROR] Tool execution failed: timeout"), we enable the model to:

  • Understand the failure context
  • Decide on alternative approaches
  • Communicate meaningfully with the user

This is not masking errors — it's proper error propagation to the model layer.

Regarding HITL concerns

The HITL workflow concern is valid, but I'd argue:

  • HITL pause/resume should be handled at a higher orchestration layer, not by leaving the agent in a corrupted state
  • If HITL requires manual ToolResultBlock, it should explicitly manage the conversation lifecycle (clearing/resuming state) rather than relying on low-level exceptions

Summary

Approach Result
Only executeToolCalls fix Covers runtime failures, but leaves recovery gaps
Configuration switches Adds complexity without solving the root problem
Current PR (both fixes) Complete fault isolation + proper LLM feedback
The combination ensures robust error handling while keeping the implementation clean and deterministic. I'm happy to discuss alternative HITL integration patterns if needed, but I believe the core fix should remain as-is.

Would appreciate your thoughts on this perspective.

Thanks for the feedback. I agree with your take on the fault isolation principle.

One nuance to consider: the current autocompletion doesn't differentiate between the root causes of a missing ToolResult. This is precisely why I previously mentioned the HITL (Human-in-the-loop) mechanism.

We need to decide where to propagate exceptions: to the model (via autocompletion) or to the system/developer. Model propagation works well for recoverable errors (timeouts, execution failures) where the model can retry or select alternative tool. But for non-recoverable issues (bugs, API misuse or other system-level failures), system-side propagation is likely more appropriate.

Refining this would require mapping out specific scenarios, so for this iteration, propagating all cases to the model seems like a reasonable baseline.

@chickenlj PTAL

@chensk0601
Copy link
Copy Markdown
Author

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.

It seems that the modifications to executeToolCalls should be sufficient to ensure a ToolResultBlock is present following a timeout or exception, which would prevent the persistent IllegalStateException.
However, I have some reservations about the auto-supplement logic in doCall. Since HITL (Human-In-The-Loop) workflows require a ToolResultBlock to be manually provided when resuming a conversation, automating this process might lead to unintended execution paths. It could also potentially mask improper HITL usage, making it harder to detect implementation errors.
It’s great to hear that this fix has been verified in production! A couple of follow-up questions:

  • Have we encountered specific scenarios where the auto-supplement logic in doCall was actually triggered?
  • In our current context, would it be possible to achieve the same goal by only modifying executeToolCalls?

Thanks for the thorough review! I appreciate the concern about HITL workflows — preserving the explicit contract for manual ToolResultBlock provision is indeed important.
Let me address your questions:

Regarding scenarios where doCall auto-supplement was triggered

While executeToolCalls's onErrorResume handles failures within tool execution, we've identified cases where pending states persist outside that coverage:

  1. Pre-execution failures: Network timeouts before toolkit.callTools() begins, JVM crashes, or async scheduling failures where tool calls are queued but never executed
  2. Memory recovery: When agent state is persisted and restored (e.g., after restart), orphaned pending IDs may exist without corresponding results

In these cases, executeToolCalls is never reached, so its error handling cannot apply. The doCall auto-recovery acts as a safety net for these edge cases.

On achieving the goal with only executeToolCalls modifications

For failures occurring during executeToolCalls, yes — the onErrorResume logic is sufficient. However, for the scenarios above, we need the doCall layer to handle states that originate from outside the tool execution flow.

Proposed compromise to protect HITL workflows

To address your concern about masking improper HITL usage, I suggest adding an explicit mode check:

if (providedResults.isEmpty()) {
    if (isHITLMode()) {  // Check if conversation is in HITL pause state
        // Strict behavior: require manual ToolResultBlock
        throw new IllegalStateException(
            "HITL workflow requires manual ToolResultBlock for pending IDs: " + pendingIds);
    }
    // Non-HITL: Auto-recover from orphaned states
    log.warn("Orphaned pending tool calls detected, auto-generating error results: {}", pendingIds);
    generateAndAddErrorToolResults(pendingIds);
}

Thanks for the thorough review! I've carefully reconsidered this, and I believe this is fundamentally a bug fix rather than an enhancement — adding configuration parameters would not be the right approach.

Why this is a critical bug (not just an enhancement)

The current behavior causes complete conversation failure on any tool execution error:

  1. First failure: Tool timeout/exception → IllegalStateException → Agent crashes
  2. Cascading effect: Pending state persists in memory
  3. All subsequent requests: Fail with same exception, making the agent unusable

This violates basic fault isolation principles — a single tool failure should not crash the entire agent permanently.

Two key reasons why auto-supplement is necessary

First, tool execution failures should not leave the agent in a permanently broken state. Whether it's a timeout, network error, or unexpected exception, the agent must recover and continue functioning. The "auto-supplement" in doCall serves as a critical safety net for orphaned pending states that originate outside executeToolCalls coverage (e.g., JVM crashes, memory recovery, pre-execution failures).
Second, manually generating error results and feeding them to the LLM is essential for proper decision-making. Without this feedback, the model has no visibility into what happened. By providing explicit error messages (e.g., "[ERROR] Tool execution failed: timeout"), we enable the model to:

  • Understand the failure context
  • Decide on alternative approaches
  • Communicate meaningfully with the user

This is not masking errors — it's proper error propagation to the model layer.

Regarding HITL concerns

The HITL workflow concern is valid, but I'd argue:

  • HITL pause/resume should be handled at a higher orchestration layer, not by leaving the agent in a corrupted state
  • If HITL requires manual ToolResultBlock, it should explicitly manage the conversation lifecycle (clearing/resuming state) rather than relying on low-level exceptions

Summary

Approach Result
Only executeToolCalls fix Covers runtime failures, but leaves recovery gaps
Configuration switches Adds complexity without solving the root problem
Current PR (both fixes) Complete fault isolation + proper LLM feedback
The combination ensures robust error handling while keeping the implementation clean and deterministic. I'm happy to discuss alternative HITL integration patterns if needed, but I believe the core fix should remain as-is.
Would appreciate your thoughts on this perspective.

Thanks for the feedback. I agree with your take on the fault isolation principle.

One nuance to consider: the current autocompletion doesn't differentiate between the root causes of a missing ToolResult. This is precisely why I previously mentioned the HITL (Human-in-the-loop) mechanism.

We need to decide where to propagate exceptions: to the model (via autocompletion) or to the system/developer. Model propagation works well for recoverable errors (timeouts, execution failures) where the model can retry or select alternative tool. But for non-recoverable issues (bugs, API misuse or other system-level failures), system-side propagation is likely more appropriate.

Refining this would require mapping out specific scenarios, so for this iteration, propagating all cases to the model seems like a reasonable baseline.

@chickenlj PTAL

Thanks for the clear guidance on error categorization. I fully agree that differentiating recoverable vs non-recoverable errors is the right long-term direction.

In the latest commit, I've made several improvements that partially address this:

  1. Hook pipeline integration: generateAndAddErrorToolResults() now routes error results through the PostActingEvent hook pipeline (same path as normal tool execution). This means developers can intercept, modify, or stopAgent() on error results via hooks — providing an extension point for custom error handling without a dedicated exception handler API.

  2. Error scope narrowing: onErrorResume is now scoped to Exception.class only, so critical JVM errors (OutOfMemoryError, StackOverflowError, etc.) propagate to the system as expected.

  3. Better diagnostics: Using ExceptionUtils.getErrorMessage() for non-null error messages and logging the full exception object (with stack trace) for production debugging.

  4. Stronger test coverage: Added assertions verifying error ToolResultBlock is written to memory, model re-invocation happens, and response content matches expectations.

For the error categorization refinement (routing certain exception types to system-side propagation), I think that fits well as a follow-up iteration once we have concrete scenarios mapped out. The current hook-based extensibility should cover most custom handling needs in the meantime.

Could you take another look when you get a chance?

@LearningGp
Copy link
Copy Markdown
Collaborator

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!对 executeToolCalls 的修改看起来很稳妥,因为它们允许模型在工具调用异常后继续执行。不过,我不确定是否应该在 doCall 中添加自动恢复机制。在我看来,将这些异常暴露给框架使用者可能更好。欢迎就此点展开讨论!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.此修复不仅确保了在工具调用超时和异常后执行能够继续进行,还解决了此前一个关键问题:即超时发生后整个代理会变得不可用,并持续抛出 IllegalStateException 异常。我已在我们的生产环境中运行此更新版本一段时间,表现完美无瑕。

It seems that the modifications to executeToolCalls should be sufficient to ensure a ToolResultBlock is present following a timeout or exception, which would prevent the persistent IllegalStateException.似乎对 executeToolCalls 的修改已足够确保在超时或异常后仍会存在一个 ToolResultBlock,从而防止持续出现 IllegalStateException。
However, I have some reservations about the auto-supplement logic in doCall. Since HITL (Human-In-The-Loop) workflows require a ToolResultBlock to be manually provided when resuming a conversation, automating this process might lead to unintended execution paths. It could also potentially mask improper HITL usage, making it harder to detect implementation errors.然而,我对 doCall 中的自动补充逻辑有些顾虑。由于 HITL(Human-In-The-Loop)工作流在恢复对话时需要手动提供 ToolResultBlock,自动化此过程可能导致意外的执行路径。这也可能掩盖不当的 HITL 使用方式,使得实现错误更难被发现。
It’s great to hear that this fix has been verified in production! A couple of follow-up questions:很高兴听到这个修复已在生产环境中验证通过!有几个后续问题:

  • Have we encountered specific scenarios where the auto-supplement logic in doCall was actually triggered?我们是否遇到过 doCall 中的自动补充逻辑实际上被触发的特定场景?
  • In our current context, would it be possible to achieve the same goal by only modifying executeToolCalls?在我们当前的上下文中,是否可以通过仅修改 executeToolCalls 来实现相同的目标?

Thanks for the thorough review! I appreciate the concern about HITL workflows — preserving the explicit contract for manual ToolResultBlock provision is indeed important.感谢详细的审查!我非常感谢你对 HITL 工作流的关切——保持对人工提供 ToolResultBlock 的显式契约确实很重要。
Let me address your questions:让我来回答你的问题:

Regarding scenarios where doCall auto-supplement was triggered关于触发 doCall 自动补全的场景

While executeToolCalls's onErrorResume handles failures within tool execution, we've identified cases where pending states persist outside that coverage:虽然 executeToolCallsonErrorResume 处理了工具执行过程中的失败情况,但我们发现仍存在一些情况,即挂起状态会超出该覆盖范围而持续存在:

  1. Pre-execution failures: Network timeouts before toolkit.callTools() begins, JVM crashes, or async scheduling failures where tool calls are queued but never executed预执行失败:在 toolkit.callTools() 开始之前发生网络超时、JVM 崩溃,或异步调度失败(工具调用被放入队列但从未被执行)
  2. Memory recovery: When agent state is persisted and restored (e.g., after restart), orphaned pending IDs may exist without corresponding results内存恢复:当代理状态被持久化并恢复时(例如在重启后),可能存在没有对应结果的孤立待处理 ID

In these cases, executeToolCalls is never reached, so its error handling cannot apply. The doCall auto-recovery acts as a safety net for these edge cases.在这些情况下, executeToolCalls 永远不会被到达,因此其错误处理无法生效。 doCall 的自动恢复机制为这些边缘情况提供了一个安全网。

On achieving the goal with only executeToolCalls modifications仅通过 executeToolCalls 次修改实现目标

For failures occurring during executeToolCalls, yes — the onErrorResume logic is sufficient. However, for the scenarios above, we need the doCall layer to handle states that originate from outside the tool execution flow.在 executeToolCalls 期间发生的故障,是的—— onErrorResume 的逻辑已经足够。然而,对于上述情况,我们需要 doCall 层来处理源自工具执行流程之外的状态。

Proposed compromise to protect HITL workflows提议的折衷方案,以保护人机协同工作流程

To address your concern about masking improper HITL usage, I suggest adding an explicit mode check:为解决您对不当使用 HITL 的担忧,我建议添加一个显式的模式检查:

if (providedResults.isEmpty()) {
    if (isHITLMode()) {  // Check if conversation is in HITL pause state
        // Strict behavior: require manual ToolResultBlock
        throw new IllegalStateException(
            "HITL workflow requires manual ToolResultBlock for pending IDs: " + pendingIds);
    }
    // Non-HITL: Auto-recover from orphaned states
    log.warn("Orphaned pending tool calls detected, auto-generating error results: {}", pendingIds);
    generateAndAddErrorToolResults(pendingIds);
}

Thanks for the thorough review! I've carefully reconsidered this, and I believe this is fundamentally a bug fix rather than an enhancement — adding configuration parameters would not be the right approach.感谢详细的审查!我已仔细重新考虑了这个问题,我认为这本质上是一个错误修复,而不是功能增强——添加配置参数并不是正确的做法。

Why this is a critical bug (not just an enhancement)为什么这是一个严重错误(而不仅仅是一个改进)

The current behavior causes complete conversation failure on any tool execution error:当前行为在任何工具执行错误时都会导致整个对话失败:

  1. First failure: Tool timeout/exception → IllegalStateException → Agent crashes首次失败:工具超时/异常 → IllegalStateException → 代理崩溃
  2. Cascading effect: Pending state persists in memory级联效应:待处理状态在内存中持续存在
  3. All subsequent requests: Fail with same exception, making the agent unusable所有后续请求:均因相同异常而失败,导致代理无法使用

This violates basic fault isolation principles — a single tool failure should not crash the entire agent permanently.这违反了基本的故障隔离原则——单个工具的失败不应当永久导致整个代理崩溃。

Two key reasons why auto-supplement is necessary自动补充必要的两个主要原因

First, tool execution failures should not leave the agent in a permanently broken state. Whether it's a timeout, network error, or unexpected exception, the agent must recover and continue functioning. The "auto-supplement" in doCall serves as a critical safety net for orphaned pending states that originate outside executeToolCalls coverage (e.g., JVM crashes, memory recovery, pre-execution failures).首先,工具执行失败不应使代理处于永久损坏状态。无论是超时、网络错误还是意外异常,代理都必须能够恢复并继续运行。 doCall 中的“自动补充”机制作为关键的安全网,用于处理超出 executeToolCalls 覆盖范围的孤立挂起状态(例如 JVM 崩溃、内存恢复、预执行失败等情况)。
Second, manually generating error results and feeding them to the LLM is essential for proper decision-making. Without this feedback, the model has no visibility into what happened. By providing explicit error messages (e.g., "[ERROR] Tool execution failed: timeout"), we enable the model to:其次,手动生成错误结果并将其提供给 LLM 对于正确决策至关重要。如果没有这种反馈,模型就无法了解发生了什么。通过提供明确的错误消息(例如,“[ERROR] 工具执行失败:超时”),我们可以使模型能够:

  • Understand the failure context理解失败的上下文
  • Decide on alternative approaches决定替代方案
  • Communicate meaningfully with the user与用户进行有意义的交流

This is not masking errors — it's proper error propagation to the model layer.这不是屏蔽错误,而是将错误正确地向上传播到模型层。

Regarding HITL concerns  关于人工介入学习(HITL)的顾虑

The HITL workflow concern is valid, but I'd argue:HITL 工作流程的问题是合理的,但我认为:

  • HITL pause/resume should be handled at a higher orchestration layer, not by leaving the agent in a corrupted stateHITL 暂停/恢复应由更高层次的编排层处理,而不是让代理处于损坏状态
  • If HITL requires manual ToolResultBlock, it should explicitly manage the conversation lifecycle (clearing/resuming state) rather than relying on low-level exceptions如果 HITL 需要手动 ToolResultBlock ,则应显式管理对话生命周期(清除/恢复状态),而不是依赖底层异常

Summary  摘要

Approach Result  方法结果
Only executeToolCalls fix Covers runtime failures, but leaves recovery gaps仅修复了运行时失败,但留下了恢复的缺口
Configuration switches Adds complexity without solving the root problem配置开关 增加了复杂性,但并未解决根本问题
Current PR (both fixes) Complete fault isolation + proper LLM feedback当前 PR(两个修复)完成故障隔离 + 正确的 LLM 反馈
The combination ensures robust error handling while keeping the implementation clean and deterministic. I'm happy to discuss alternative HITL integration patterns if needed, but I believe the core fix should remain as-is.这种组合确保了健壮的错误处理,同时保持实现的简洁性和确定性。如果需要,我很乐意讨论其他的人机协同(HITL)集成模式,但我认为核心修复应保持原样。
Would appreciate your thoughts on this perspective.非常感谢您对此观点的反馈。

Thanks for the feedback. I agree with your take on the fault isolation principle.感谢反馈。我同意你对故障隔离原则的看法。
One nuance to consider: the current autocompletion doesn't differentiate between the root causes of a missing ToolResult. This is precisely why I previously mentioned the HITL (Human-in-the-loop) mechanism.需要注意的一个细节是:当前的自动补全功能无法区分 ToolResult 缺失的根本原因。这正是我之前提到 HITL(人机协同)机制的原因。
We need to decide where to propagate exceptions: to the model (via autocompletion) or to the system/developer. Model propagation works well for recoverable errors (timeouts, execution failures) where the model can retry or select alternative tool. But for non-recoverable issues (bugs, API misuse or other system-level failures), system-side propagation is likely more appropriate.我们需要决定异常应该传播到哪里:通过自动补全传播给模型,还是传播给系统/开发者。对于可恢复的错误(如超时、执行失败),将异常传播给模型效果较好,因为模型可以重试或选择其他工具。但对于不可恢复的问题(如 bug、API 使用不当或其他系统级故障),向系统侧传播异常可能更合适。
Refining this would require mapping out specific scenarios, so for this iteration, propagating all cases to the model seems like a reasonable baseline.进一步完善需要明确具体场景,因此在这一版本中,将所有情况传递给模型似乎是一个合理的基线方案。
@chickenlj PTAL  请查看

Thanks for the clear guidance on error categorization. I fully agree that differentiating recoverable vs non-recoverable errors is the right long-term direction.感谢您对错误分类的清晰指导。我完全同意,区分可恢复与不可恢复错误是正确的长期发展方向。

In the latest commit, I've made several improvements that partially address this:在最新的提交中,我进行了几项改进,部分解决了这个问题:

  1. Hook pipeline integration: generateAndAddErrorToolResults() now routes error results through the PostActingEvent hook pipeline (same path as normal tool execution). This means developers can intercept, modify, or stopAgent() on error results via hooks — providing an extension point for custom error handling without a dedicated exception handler API.钩子管道集成: generateAndAddErrorToolResults() 现在通过 PostActingEvent 钩子管道路由错误结果(与正常工具执行相同的路径)。这意味着开发者可以通过钩子拦截、修改或 stopAgent() 错误结果,从而在无需专用异常处理 API 的情况下提供自定义错误处理的扩展点。
  2. Error scope narrowing: onErrorResume is now scoped to Exception.class only, so critical JVM errors (OutOfMemoryError, StackOverflowError, etc.) propagate to the system as expected.错误作用域缩小: onErrorResume 现在仅限于 Exception.class 的作用域,因此关键的 JVM 错误( OutOfMemoryErrorStackOverflowError 等)会如预期那样传播到系统中。
  3. Better diagnostics: Using ExceptionUtils.getErrorMessage() for non-null error messages and logging the full exception object (with stack trace) for production debugging.更好的诊断:对于非空错误消息使用 ExceptionUtils.getErrorMessage() ,并在生产调试时记录完整的异常对象(包含堆栈跟踪)。
  4. Stronger test coverage: Added assertions verifying error ToolResultBlock is written to memory, model re-invocation happens, and response content matches expectations.更强的测试覆盖率:添加了断言以验证错误 ToolResultBlock 是否已写入内存、模型是否重新调用,以及响应内容是否符合预期。

For the error categorization refinement (routing certain exception types to system-side propagation), I think that fits well as a follow-up iteration once we have concrete scenarios mapped out. The current hook-based extensibility should cover most custom handling needs in the meantime.对于错误分类的优化(将某些异常类型路由到系统侧传播),我认为这可以在我们明确具体场景后再作为后续迭代进行。目前基于钩子的可扩展性应该足以满足大部分自定义处理需求。

Could you take another look when you get a chance?你有空时能再看一下吗?

Works for me.

chensk0601 and others added 2 commits March 30, 2026 17:14
…Resume

Avoid swallowing InterruptedException in the onErrorResume handler.
In AgentScope, InterruptedException is the cooperative interruption
signal used by the agent stop policy. Converting it into an error
tool result would silently break the interruption mechanism.

Now InterruptedException is re-thrown via Mono.error() so it
propagates to AgentBase.createErrorHandler() which routes it
to handleInterrupt() as intended.
@chensk0601 chensk0601 requested a review from LearningGp March 30, 2026 09:28
Copy link
Copy Markdown
Collaborator

@LearningGp LearningGp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: java.lang.IllegalStateException: Cannot add messages without tool results when pending tool calls exist. Pending IDs: [call_xxx]

3 participants